{
  "cells": [
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# Detect anomalies in Azure Monitor log data using machine learning techniques \n",
        "\n",
        "This notebook demonstrates how a user can use the [`azure-monitor-query`](https://pypi.org/project/azure-monitor-query/) library to retrieve Azure Monitor log data for training a machine learning model to detect anomalies. The [scikit-learn](https://scikit-learn.org/stable/) library is used to train two regression models on historical data, and then the trained model with better performance is used to predict new values and identify anomalies.\n",
        "\n",
        "1. [**Getting Started**](#getting-started) - Install dependencies, and define helper functions and constants.\n",
        "2. [**Query and visualize data**](#query-and-visualize) - Explore data from a Log Analytics workspace.\n",
        "3. [**Analyze data using machine learning techniques**](#analyze-data)\n",
        "   * [**Prepare data for model training**](#prepare-data) - Prepare data for model training.\n",
        "   * [**Train and test regression models**](#train-regression-models) - Train a linear regression model and a gradient boosting regression model on historical data.\n",
        "   * [**Predict new values and identify anomalies**](#identify-anomalies) - Score new data, or predict new values, using one of the trained models to identify anomalies.\n",
        "4. [**Ingest anomalies**](#ingest-anomalies) - Upload detected anomalies into a custom table in your Log Analytics workspace for further analysis. (optional)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id='getting-started'></a>\n",
        "\n",
        "## 1. Getting started\n",
        "\n",
        "Let's start by installing the Azure Monitor Query, Azure Identity and Azure Monitor Ingestion client libraries along with the `pandas` data analysis library, `plotly` visualization library, and `scikit-learn` machine learning library."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667411859
        }
      },
      "outputs": [],
      "source": [
        "import sys\n",
        "\n",
        "!{sys.executable} -m pip install --upgrade azure-monitor-query azure-identity azure-monitor-ingestion\n",
        "\n",
        "!{sys.executable} -m pip install --upgrade pandas numpy plotly scikit-learn nbformat"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Setup\n",
        "\n",
        "Some initial setup is needed before we can run the sample code.\n",
        "\n",
        "#### Set Log Analytics workspace ID\n",
        "\n",
        "Set the `LOGS_WORKSPACE_ID` variable below to the ID of your Log Analytics workspace. Currently, it is set to use the [Azure Monitor Demo workspace](https://portal.azure.com/#blade/Microsoft_Azure_Monitoring_Logs/DemoLogsBlade), but it is recommended to use your own workspace if available."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667412121
        }
      },
      "outputs": [],
      "source": [
        "LOGS_WORKSPACE_ID = \"DEMO_WORKSPACE\""
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Create LogsQueryClient\n",
        "\n",
        "An authenticated client is needed to query Azure Monitor Logs. The following code shows how to create a `LogsQueryClient` using `DefaultAzureCredential`.\n",
        "\n",
        "Note, that `LogsQueryClient` typically only supports authentication with Microsoft Entra ID token credentials. However, we can pass in a custom authentication policy to enable the use of API keys. This allows the client to query the [demo workspace](https://learn.microsoft.com/azure/azure-monitor/logs/api/access-api#authenticate-with-a-demo-api-key). Do note that the availability and access to this demo workspace is subject to change, and it is recommended to use your own Log Analytics workspace."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667412326
        }
      },
      "outputs": [],
      "source": [
        "from azure.core.credentials import AzureKeyCredential\n",
        "from azure.core.pipeline.policies import AzureKeyCredentialPolicy\n",
        "from azure.identity import DefaultAzureCredential\n",
        "from azure.monitor.query import LogsQueryClient\n",
        "\n",
        "if LOGS_WORKSPACE_ID == \"DEMO_WORKSPACE\":\n",
        "    credential = AzureKeyCredential(\"DEMO_KEY\")\n",
        "    header_name = \"X-Api-Key\"\n",
        "    authentication_policy = AzureKeyCredentialPolicy(name=header_name, credential=credential)\n",
        "else:\n",
        "    credential = DefaultAzureCredential()\n",
        "    authentication_policy = None\n",
        "\n",
        "logs_query_client = LogsQueryClient(credential, authentication_policy=authentication_policy)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Define helper functions\n",
        "\n",
        "Next, we'll define some helper functions that will be used throughout the notebook.\n",
        "\n",
        "- `query_logs_workspace` - Queries the Log Analytics workspace for a given query and returns the results as a `pandas` DataFrame.\n",
        "- `display_graph` - Given a `pandas` DataFrame, displays a `plotly` line graph showing hourly usage for various data types over time."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667412535
        }
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import plotly.express as px\n",
        "\n",
        "from azure.monitor.query import LogsQueryStatus\n",
        "from azure.core.exceptions import HttpResponseError\n",
        "\n",
        "\n",
        "def query_logs_workspace(query):\n",
        "    try:\n",
        "        response = logs_query_client.query_workspace(LOGS_WORKSPACE_ID, query, timespan=None)\n",
        "        if response.status == LogsQueryStatus.SUCCESS:\n",
        "            data = response.tables\n",
        "        else:\n",
        "            error = response.partial_error\n",
        "            data = response.partial_data\n",
        "            print(error)\n",
        "\n",
        "        for table in data:\n",
        "            my_data = pd.DataFrame(data=table.rows, columns=table.columns)\n",
        "    except HttpResponseError as err:\n",
        "        print(\"something fatal happened\")\n",
        "        print (err)\n",
        "    return my_data\n",
        "\n",
        "\n",
        "def display_graph(df, title):\n",
        "    df = df.sort_values(by=\"TimeGenerated\")\n",
        "    graph = px.line(df, x='TimeGenerated', y=\"ActualUsage\", color='DataType', title=title)\n",
        "    graph.show()\n",
        "\n",
        "\n",
        "# Set display options for visualizing\n",
        "def display_options():\n",
        "    display = pd.options.display\n",
        "    display.max_columns = 10\n",
        "    display.max_rows = 10\n",
        "    display.max_colwidth = 300\n",
        "    display.width = None\n",
        "    return None\n",
        "\n",
        "display_options()\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id='query-and-visualize'></a>\n",
        "\n",
        "## 2. Query and visualize data\n",
        "\n",
        "Let's start by exploring the data in the Log Analytics workspace. We'll start by running the following query on the [Usage](https://learn.microsoft.com/azure/azure-monitor/reference/tables/usage) table which is assumed to exist inside the workspace. \n",
        "\n",
        "This query will check how much data (in Megabytes) was ingested into each of the tables (data types) in the Log Analytics workspace each hour over the past week."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667412978
        }
      },
      "outputs": [],
      "source": [
        "TABLE = \"Usage\"\n",
        "\n",
        "QUERY = f\"\"\"\n",
        "let starttime = 7d; // Start date for the time series, counting back from the current date\n",
        "let endtime = 0d; // today\n",
        "{TABLE} | project TimeGenerated, DataType, Quantity\n",
        "| where TimeGenerated between (ago(starttime)..ago(endtime))\n",
        "| summarize ActualUsage=sum(Quantity) by TimeGenerated=bin(TimeGenerated, 1h), DataType\n",
        "\"\"\"\n",
        "\n",
        "df = query_logs_workspace(QUERY)\n",
        "display(df)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now, let's view the data as a graph using the helper function we defined above."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667416120
        }
      },
      "outputs": [],
      "source": [
        "display_graph(df, \"All Data Types - last week usage\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id='analyze-data'></a>\n",
        "\n",
        "## 3. Analyze data using machine learning techniques\n",
        "\n",
        "<a id='prepare-data'></a>\n",
        "### Prepare data for model training\n",
        "\n",
        "After exploring the available data, let's use a subset of it for model training. We will choose a few of the data types to train our model on (defined in `data_types` below). "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667416435
        }
      },
      "outputs": [],
      "source": [
        "# Insert here the selected data types for analysis - for simplicity we picked 6, which seemed most interesting at exploration of data step\n",
        "data_types = [\"ContainerLog\", \"AzureNetworkAnalytics_CL\", \"StorageBlobLogs\", \"AzureDiagnostics\", \"Perf\", \"AVSSyslog\"]\n",
        "\n",
        "# Get all available data types that have data.\n",
        "available_data_types = df[\"DataType\"].unique()\n",
        "\n",
        "# Filter out data types that are not available in the data.\n",
        "data_types = list(filter(lambda data_type: data_type in available_data_types, data_types))\n",
        "\n",
        "if data_types:\n",
        "    print(f\"Selected data type for analysis: {data_types}\")\n",
        "else:\n",
        "    raise SystemExit(\"No datatypes found. Please select data types which have data\")\n",
        "\n",
        "# Returns usage query for selected data types for given time range\n",
        "def get_selected_datatypes(data_types, start, end):\n",
        "    data_types_string = \",\".join([f\"'{data_type}'\" for data_type in data_types])\n",
        "    query = (\n",
        "        f\"let starttime = {start}d; \"\n",
        "        f\"let endtime = {end}d; \"\n",
        "        \"Usage | project TimeGenerated, DataType, Quantity \"\n",
        "        \"| where TimeGenerated between (ago(starttime)..ago(endtime)) \"\n",
        "        f\"| where DataType in ({data_types_string}) \"\n",
        "        \"| summarize ActualUsage=sum(Quantity) by TimeGenerated=bin(TimeGenerated, 1h), DataType\"\n",
        "    )\n",
        "    return query\n",
        "\n",
        "# We will query the data from the first 3 weeks of the past month.\n",
        "# Feel free to change the start and end dates.\n",
        "start = 28\n",
        "end = 7\n",
        "\n",
        "query = get_selected_datatypes(data_types, start, end)\n",
        "my_data = query_logs_workspace(query)\n",
        "display(my_data)\n",
        "\n",
        "if my_data.empty:\n",
        "    raise SystemExit(\"No data found for training. Please select data types which have data\")\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667416744
        }
      },
      "outputs": [],
      "source": [
        "display_graph(my_data, \"Selected Data Types - Historical Data Usage (3 weeks)\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Let's now expand the timestamp information in the TimeGenerated field into separate columns for year, month, day, and hour using [`DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components) from `pandas`. This will allow us to use the timestamp information as features in our model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667416948
        }
      },
      "outputs": [],
      "source": [
        "my_data['Year'] = pd.DatetimeIndex(my_data['TimeGenerated']).year\n",
        "my_data['Month'] = pd.DatetimeIndex(my_data['TimeGenerated']).month\n",
        "my_data['Day'] = pd.DatetimeIndex(my_data['TimeGenerated']).day\n",
        "my_data['Hour'] = pd.DatetimeIndex(my_data['TimeGenerated']).hour"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Finally, define the X and y variables for training the model. The X variable will contain the features (timestamp information) and the y variable will contain the target (data usage in Megabytes)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667417190
        }
      },
      "outputs": [],
      "source": [
        "Y = my_data['ActualUsage']\n",
        "X = my_data[['DataType', 'Year', 'Month', 'Day', 'Hour']]\n",
        "\n",
        "display(X)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"train-regression-models\"></a>\n",
        "\n",
        "### Train and test regression models on historical data\n",
        "\n",
        "Now that we have our data prepared, let's experiment with two different regression models and check which of the models most closely predicts the data in our testing set:\n",
        "\n",
        "#### Define cross validator\n",
        "\n",
        "Before we train, we'll define a cross-validator using [`TimeSeriesSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html#sklearn.model_selection.TimeSeriesSplit) from `scikit-learn`.  The `evaluate` function defined below will use this cross-validator to evaluate the performance of the models we train.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667417450
        }
      },
      "outputs": [],
      "source": [
        "from sklearn.model_selection import cross_validate\n",
        "from sklearn.model_selection import TimeSeriesSplit\n",
        "\n",
        "ts_cv = TimeSeriesSplit()\n",
        "\n",
        "def evaluate(model, X, Y, cv):\n",
        "    cv_results = cross_validate(\n",
        "        model,\n",
        "        X,\n",
        "        Y,\n",
        "        cv=cv,\n",
        "        scoring=[\"neg_mean_absolute_error\", \"neg_root_mean_squared_error\"],\n",
        "    )\n",
        "    mae = -cv_results[\"test_neg_mean_absolute_error\"]\n",
        "    rmse = -cv_results[\"test_neg_root_mean_squared_error\"]\n",
        "    print(\n",
        "        f\"Mean Absolute Error:     {mae.mean():.3f} +/- {mae.std():.3f}\\n\"\n",
        "        f\"Root Mean Squared Error: {rmse.mean():.3f} +/- {rmse.std():.3f}\"\n",
        "    )"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "#### Train and evaluate a linear regression model\n",
        "\n",
        "First, let's train a linear regression model.\n",
        "\n",
        "Here, we first apply some transformations to the input data:\n",
        "\n",
        "* One-hot encode the categorical features using [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder). This is how we numerically represent \"DataTypes\" in our model.\n",
        "* Scales numerical features - in our case, hourly usage - to the 0-1 range.\n",
        "\n",
        "Then, we train the model using an extension of Linear regression called [Ridge Regression](https://en.wikipedia.org/wiki/Ridge_regression). This is a linear regression model that uses L2 [regularization](https://en.wikipedia.org/wiki/Regularization_(mathematics)) to prevent overfitting.\n",
        "\n",
        "Finally, we evaluate the model using the cross-validator defined above."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683669757397
        }
      },
      "outputs": [],
      "source": [
        "from sklearn.pipeline import make_pipeline\n",
        "from sklearn.compose import ColumnTransformer\n",
        "from sklearn.preprocessing import OneHotEncoder\n",
        "from sklearn.preprocessing import MinMaxScaler\n",
        "from sklearn.linear_model import RidgeCV\n",
        "import numpy as np\n",
        "\n",
        "\n",
        "categorical_columns = [\"DataType\"]\n",
        "\n",
        "one_hot_encoder = OneHotEncoder(handle_unknown=\"ignore\", sparse_output=False)\n",
        "\n",
        "# Get 25 alpha values between 10^-6 and 10^6\n",
        "alphas = np.logspace(-6, 6, 25)\n",
        "ridge_linear_pipeline = make_pipeline(\n",
        "    ColumnTransformer(\n",
        "        transformers=[\n",
        "            (\"categorical\", one_hot_encoder, categorical_columns),\n",
        "        ],\n",
        "        remainder=MinMaxScaler(),\n",
        "    ),\n",
        "    RidgeCV(alphas=alphas),\n",
        ")\n",
        "\n",
        "ridge_linear_pipeline.fit(X, Y)\n",
        "\n",
        "print(\"Score of Linear Regression:\")\n",
        "evaluate(ridge_linear_pipeline, X, Y, cv=ts_cv)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Train and evaluate a gradient boosting regression model\n",
        "\n",
        "Next, let's train a gradient boosting regression model. Here, we'll use [`HistGradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor) from `scikit-learn`. We will do ordinal encoding of the categorical features using [`OrdinalEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667417902
        }
      },
      "outputs": [],
      "source": [
        "from sklearn.preprocessing import OrdinalEncoder\n",
        "from sklearn.ensemble import HistGradientBoostingRegressor\n",
        "\n",
        "\n",
        "ordinal_encoder = OrdinalEncoder(categories=[data_types])\n",
        "\n",
        "gradient_boosting_pipeline = make_pipeline(\n",
        "    ColumnTransformer(\n",
        "        transformers=[\n",
        "            (\"categorical\", ordinal_encoder, categorical_columns),\n",
        "        ],\n",
        "        remainder=\"passthrough\",\n",
        "    ),\n",
        "    HistGradientBoostingRegressor(\n",
        "        categorical_features=range(1),\n",
        "    ),\n",
        ")\n",
        "\n",
        "gradient_boosting_pipeline.fit(X, Y)\n",
        "print(\"Score of Gradient Boosting Regression:\")\n",
        "evaluate(gradient_boosting_pipeline, X, Y, cv=ts_cv)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Take a look at the error metrics for both models. Which model performs better?\n",
        "\n",
        "Typically, for this dataset, the gradient boosting regression model will perform better than the linear regression model based on the lower error metrics. Since the gradient boosting regression model performs better, we'll use it to predict new values and identify anomalies."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Save the model\n",
        "\n",
        "First, we need to pickle the model so that we can use it later."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667418162
        }
      },
      "outputs": [],
      "source": [
        "import joblib\n",
        "\n",
        "# Save the model as a pickle file\n",
        "filename = './myModel.pkl'\n",
        "joblib.dump(gradient_boosting_pipeline, filename)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"identify-anomalies\"></a>\n",
        "\n",
        "### Predict new values and identify anomalies\n",
        "\n",
        "Now that we have a trained model, let's use it to predict new values and identify anomalies. Let's start by querying ingestion information for the six data types we selected over the past week."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667418434
        }
      },
      "outputs": [],
      "source": [
        "# Time range from past week.\n",
        "start = 7\n",
        "end = 0\n",
        "\n",
        "query = get_selected_datatypes(data_types, start, end)\n",
        "new_data = query_logs_workspace(query)\n",
        "\n",
        "new_data['Year'] = pd.DatetimeIndex(new_data['TimeGenerated']).year\n",
        "new_data['Month'] = pd.DatetimeIndex(new_data['TimeGenerated']).month\n",
        "new_data['Day'] = pd.DatetimeIndex(new_data['TimeGenerated']).day\n",
        "new_data['Hour'] = pd.DatetimeIndex(new_data['TimeGenerated']).hour\n",
        "display(new_data)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Visualize the data in a graph:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667418762
        }
      },
      "outputs": [],
      "source": [
        "display_graph(new_data, \"Selected Data Types - New Data Usage (1 week)\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Next, load the model from the pickle file and use it to predict (score) values for the latest data."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667418994
        }
      },
      "outputs": [],
      "source": [
        "# Load the model from the file\n",
        "X_new = new_data[['DataType', 'Year', 'Month', 'Day', 'Hour']]\n",
        "\n",
        "loaded_model = joblib.load(filename)\n",
        "Predictions_new = loaded_model.predict(X_new)\n",
        "new_data[\"PredictedUsage\"] = Predictions_new\n",
        "display(new_data)\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "From the displayed DataFrame, you should see an additional column called \"PredictedUsage\" which contains the predicted usage values."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Identify ingestion anomalies\n",
        "\n",
        "Let's now try to identify anomalies. There are multiple approaches to identifying anomalies, but, for this sample, we'll use a method call [Tukey's fences](https://en.wikipedia.org/wiki/Outlier#Tukey%27s_fences).\n",
        "\n",
        "Note: The KQL [series_decompose_anomalies](https://learn.microsoft.com/azure/data-explorer/kusto/query/series-decompose-anomaliesfunction) function also uses the Tukey's fences method to detect anomalies."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "#### Define helper functions\n",
        "\n",
        "Let's define a couple of helper function that will help us identify anomalies. These will update a DataFrame with a new column called `Anomalies` where `1` indicates a positive anomaly, and `-1` indicates a negative anomaly."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667419166
        }
      },
      "outputs": [],
      "source": [
        "def outlier_range(data_column):\n",
        "    sorted(data_column)\n",
        "    Q1, Q3 = np.percentile(data_column , [10,90])\n",
        "    IQR = Q3 - Q1\n",
        "    lower_bound = Q1 - (1.5 * IQR)\n",
        "    upper_bound = Q3 + (1.5 * IQR)\n",
        "    return lower_bound, upper_bound\n",
        "\n",
        "def outlier_update_data_frame(df):\n",
        "    lower_bound, upper_bound = outlier_range(df['Residual'])\n",
        "\n",
        "    df.loc[((df['Residual'] < lower_bound) | (df['Residual'] > upper_bound)) & (df['Residual'] < 0) , 'Anomalies'] = -1\n",
        "    df.loc[((df['Residual'] < lower_bound) | (df['Residual'] > upper_bound)) & (df['Residual'] >= 0) , 'Anomalies'] = 1\n",
        "    df.loc[(df['Residual'] >= lower_bound) & (df['Residual'] <= upper_bound), 'Anomalies'] = 0\n",
        "\n",
        "    return df[df['Anomalies'] != 0]"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Run the helper functions on the DataFrame to identify anomalies in the new data:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667419398
        }
      },
      "outputs": [],
      "source": [
        "\n",
        "new_data[\"Residual\"] = new_data[\"ActualUsage\"] - new_data[\"PredictedUsage\"]\n",
        "new_data_datatypes = new_data[\"DataType\"].unique()\n",
        "\n",
        "new_data.set_index('DataType', inplace=True)\n",
        "\n",
        "anomalies_df = pd.DataFrame()\n",
        "for data_type in new_data_datatypes:\n",
        "    type_anomalies = outlier_update_data_frame(new_data.loc[data_type, :])\n",
        "    # Add DataType as a column since we reset index later on\n",
        "    type_anomalies['DataType'] = data_type\n",
        "    anomalies_df = pd.concat([anomalies_df, type_anomalies], ignore_index=True)\n",
        "\n",
        "new_data.reset_index(inplace=True)\n",
        "\n",
        "print(f\"{len(anomalies_df)} anomalies detected\")\n",
        "display(anomalies_df)"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "<a id=\"ingest-anomalies\"></a>\n",
        "\n",
        "## 4. Ingest anomalies (optional)\n",
        "\n",
        "Optionally, we can upload detected anomalies to a custom table in a Log Analytics workspace. This can be useful for further analysis or visualization.\n",
        "\n",
        "To send data to your Log Analytics workspace, you need a registered Microsoft Entra ID application, custom table, data collection endpoint (DCE), and data collection rule (DCR). You also need to assign permissions to data collection rule so that the Azure AD application can upload.\n",
        "\n",
        "Use the following tutorial for specifics on creating the prerequisites: [Tutorial: Send data to Azure Monitor Logs with Logs ingestion API (Azure portal) ](https://learn.microsoft.com/azure/azure-monitor/logs/tutorial-logs-ingestion-portal)."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "When creating the table for the custom logs, use the JSON file created in the following cell when asked to upload a sample JSON file:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667419568
        }
      },
      "outputs": [],
      "source": [
        "import json\n",
        "\n",
        "sample_data = [{\n",
        "    \"TimeGenerated\": \"2023-03-19T19:56:43.7447391Z\",\n",
        "    \"ActualUsage\": 40.1,\n",
        "    \"PredictedUsage\": 45.1,\n",
        "    \"Anomalies\": -1,\n",
        "    \"DataType\": \"AzureDiagnostics\"\n",
        "}]\n",
        "\n",
        "with open(\"data_sample.json\", \"w\") as file:\n",
        "    json.dump(sample_data, file)\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Then use the following in the `Transformation Editor`:\n",
        "\n",
        "`source | extend AnomalyTimeGenerated = todatetime(TimeGenerated) | extend TimeGenerated = now() `\n",
        "\n",
        "This will add a transformation so that `AnomalyTimeGenerated` indicates the time when the anomaly was detected and `TimeGenerated` indicates the time when the anomaly was uploaded to the custom table."
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Define constants\n",
        "\n",
        "Define constants for your Azure AD application and DCR/DCE information."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667419730
        }
      },
      "outputs": [],
      "source": [
        "AZURE_TENANT_ID = \"<Tenant ID>\"; # ID of the tenant where the data collection endpoint resides\n",
        "AZURE_CLIENT_ID = \"<Application ID>\"; # Application ID to which you granted permissions to your data collection rule\n",
        "AZURE_CLIENT_SECRET = \"<Client secret>\"; # Secret created for the application\n",
        "\n",
        "LOGS_DCR_STREAM_NAME = \"<Custom stream name>\" # Name of the custom stream from the data collection rule (e.g. \"Custom-DetectedAnomalies_CL\")\n",
        "LOGS_DCR_RULE_ID = \"<Data collection rule immutableId>\" # immutableId of your data collection rule (Can be found in the JSON View of the data collection rule overview page)\n",
        "DATA_COLLECTION_ENDPOINT =  \"<Logs ingestion URL of your endpoint>\" # URL that looks like this: https://xxxx.ingest.monitor.azure.com"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Ingest the data\n",
        "\n",
        "After creating the table and Data collection rule, you can use the following code to ingest the data into the custom table.\n",
        "\n",
        "**Note:** After creating the table, it can take up to 15 minutes for the table to be available for ingestion through the DCR stream."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "gather": {
          "logged": 1683667420147
        }
      },
      "outputs": [],
      "source": [
        "from azure.core.exceptions import HttpResponseError\n",
        "from azure.identity import ClientSecretCredential\n",
        "from azure.monitor.ingestion import LogsIngestionClient\n",
        "\n",
        "\n",
        "credential = ClientSecretCredential(\n",
        "    tenant_id=AZURE_TENANT_ID,\n",
        "    client_id=AZURE_CLIENT_ID,\n",
        "    client_secret=AZURE_CLIENT_SECRET\n",
        ")\n",
        "\n",
        "client = LogsIngestionClient(endpoint=DATA_COLLECTION_ENDPOINT, credential=credential, logging_enable=True)\n",
        "\n",
        "body = json.loads(anomalies_df.to_json(orient='records', date_format='iso'))\n",
        "\n",
        "try:\n",
        "   response = client.upload(rule_id=LOGS_DCR_RULE_ID, stream_name=LOGS_DCR_STREAM_NAME, logs=body)\n",
        "   print(\"Upload request accepted\")\n",
        "except HttpResponseError as e:\n",
        "    print(f\"Upload failed: {e}\")\n",
        "\n"
      ]
    }
  ],
  "metadata": {
    "kernel_info": {
      "name": "python3"
    },
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.11.1"
    },
    "microsoft": {
      "host": {
        "AzureML": {
          "notebookHasBeenCompleted": true
        }
      },
      "ms_spell_check": {
        "ms_spell_check_language": "en"
      }
    },
    "orig_nbformat": 4
  },
  "nbformat": 4,
  "nbformat_minor": 2
}
